SpatialWavePredict: a tutorial-based primer and toolbox for forecasting growth trajectories using the ensemble spatial wave sub-epidemic modeling framework

Background Dynamical mathematical models defined by a system of differential equations are typically not easily accessible to non-experts. However, forecasts based on these types of models can help gain insights into the mechanisms driving the process and may outcompete simpler phenomenological growth models. Here we introduce a friendly toolbox, SpatialWavePredict, to characterize and forecast the spatial wave sub-epidemic model, which captures diverse wave dynamics by aggregating multiple asynchronous growth processes and has outperformed simpler phenomenological growth models in short-term forecasts of various infectious diseases outbreaks including SARS, Ebola, and the early waves of the COVID-19 pandemic in the US. Results This tutorial-based primer introduces and illustrates a user-friendly MATLAB toolbox for fitting and forecasting time-series trajectories using an ensemble spatial wave sub-epidemic model based on ordinary differential equations. Scientists, policymakers, and students can use the toolbox to conduct real-time short-term forecasts. The five-parameter epidemic wave model in the toolbox aggregates linked overlapping sub-epidemics and captures a rich spectrum of epidemic wave dynamics, including oscillatory wave behavior and plateaus. An ensemble strategy aims to improve forecasting performance by combining the resulting top-ranked models. The toolbox provides a tutorial for forecasting time-series trajectories, including the full uncertainty distribution derived through parametric bootstrapping, which is needed to construct prediction intervals and evaluate their accuracy. Functions are available to assess forecasting performance, estimation methods, error structures in the data, and forecasting horizons. The toolbox also includes functions to quantify forecasting performance using metrics that evaluate point and distributional forecasts, including the weighted interval score. Conclusions We have developed the first comprehensive toolbox to characterize and forecast time-series data using an ensemble spatial wave sub-epidemic wave model. As an epidemic situation or contagion occurs, the tools presented in this tutorial can facilitate policymakers to guide the implementation of containment strategies and assess the impact of control interventions. We demonstrate the functionality of the toolbox with examples, including a tutorial video, and is illustrated using daily data on the COVID-19 pandemic in the USA. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02241-2.


Background
Developing reliable methods for forecasting dynamic growth processes is critical for decision-making in problems ranging from predicting the weather, forecasting the trajectory of an emerging epidemic, the growth or decline of economic variables, election outcomes, and sporting events [1].While statistical methods such as ARIMA and exponential smoothing are robust and broadly competitive for forecasting time series [2][3][4][5][6], dynamical mathematical models defined by a system of differential equations are typically not easily accessible to non-experts.However, forecasts based on these types of models can help characterize the mechanisms driving the process [7].They may offer higher forecasting performance than purely statistical approaches based on statistical evaluation criteria like mean absolute and squared errors [8][9][10][11].Here we focus on dynamical models that can characterize growth processes that give rise to waves of variable shapes and sizes [12][13][14].The complexity of this family of growth models ranges from single differential equation models with a few parameters, such as the 3-parameter generalized-logistic growth model (GLM) [14], to systems of ordinary differential equations (ODEs) that capture diverse wave dynamics by aggregating multiple asynchronous growth processes [13].The spatial wave sub-epidemic framework has outperformed simpler phenomenological growth models in forecasts of various infectious diseases, including severe acute respiratory syndrome (SARS), Ebola, and the early waves of the coronavirus disease 2019 (COVID-19) pandemic in the United States (US) [13,15].
This tutorial paper introduces a user-friendly MAT-LAB toolbox to fit and forecast time-series trajectories using the spatial wave sub-epidemic dynamic growth model based on ordinary differential equations, which was initially developed to characterize and derive shortterm forecasts of epidemic trajectories [13,16].This mathematical framework characterizes time-series trajectories by aggregating multiple asynchronous growth processes.Each growth process (i.e., sub-epidemic) is modeled using a simple phenomenological growth model such as the generalized logistic growth model (GLM).This framework supports a family of growth models that yield similar fits to the calibration data, but their corresponding forecasts could produce diverse trajectories.Hence, we also incorporate ensemble techniques to combine the resulting models to boost forecasting performance [16,17].
This toolbox is written for a diverse audience, including students training in time-series forecasting.It allows the user to conduct parameter estimation and forecasting with quantified uncertainty and evaluate forecasting performance using a set of standard metrics, including the coverage of the 95% prediction interval and the weighted interval score, which account for the uncertainty of the predictions.The toolbox allows scientists and policymakers to generate short-term forecasts by relying on minimal data of the process of interest, such as an unfolding epidemic or natural disaster.
The toolbox provides prediction intervals and allows the user to employ different estimation methods, assumptions of the error structure, and forecasting horizons.For instance, the toolbox includes estimation methods such as the nonlinear least squares estimation and maximum likelihood estimation (MLE) with different assumptions about the error structure of the observed data, including Poisson, negative binomial, and normal distributions, as well as quantification of the uncertainty based on a parametric bootstrapping approach.The model also provides flexibility to choose the underlying building block of the growth process.In addition, the toolbox includes functions to derive weighted and unweighted ensembles based on the resulting top-ranked models.The full functionality of the toolbox is illustrated using daily time series of COVID-19 cases in the US, and in the process, shows that this framework outcompetes simpler single growth models and simple time-series models (e.g., ARIMA, GAM, SLR) in calibration and forecasting performance.
We start by describing the format of the input timeseries data, followed by the methods employed for parameter estimation.Next, we describe the underlying methodology, user parameters, and functions to calibrate, evaluate, and display the model fits.Finally, we introduce the functions to generate, display, and quantify the performance of model-based forecasts with specific examples in the context of the daily COVID-19 case data reported in the USA.A tutorial video that demonstrates the toolbox functionality is available at: https:// www.youtu be.com/ watch?v= qxuF_ tTzcR 8&t= 47s.

Implementation
In this section, we describe the methods implemented in this toolbox and provide a brief overview of the toolbox functions.

Installing the toolbox
• Download the MATLAB code located in the folder spatialWave_subepidemicFramework code from the GitHub repository: https:// github.com/ gchow ell/ spati al_ wave_ subep idemic_ frame work.• Create an 'input' folder in your working directory where your input data will be stored.• Create an 'output' folder in your working directory where the output files will be stored.
• Open a MATLAB session.

Overview of the toolbox functions
The methodological workflow of the tutorial is organized as follows: (1) plotting model simulations, (2) fitting the models to data with quantified uncertainty, (3) plotting the resulting model fits and calibration performance metrics, and (4) plotting model-based forecasts and the associated forecasting performance metrics.Table 1 and Supplementary Table 1 list the names of both user and internal functions associated with the toolbox, along with a brief description of their role.As described below, the user needs to specify the parameters related to model fitting and forecasting in the default options_fit.mand options_forecast.mfiles.

Parameter estimation method
Let f (t, �) denote the expected curve of the epidem- ic's trajectory.We can estimate model parameters by fitting the model solution to the observed data via nonlinear least squares [18] or maximum likelihood estimation with specific assumptions about the error structure in the data [19] by specifying parameter <method1> in the options.mfile.For nonlinear least squares (i.e., <method1>=0), this is achieved by searching for the set of parameters Θ that minimizes the sum of squared differences between the smoothed data y t j = y t 1 , y t 2 . . ..y t n d and the model mean, corre- sponding to f (t, �) .That is, � = (C thr , r, p, q, K 0 ) in the sub-epidemic wave model (given below) is estimated by � = argmin 2 .We estimate the parameter C thr through simple discretization of its range of plausible values.Our estimation procedure consists of two steps.First, for each C thr , we search for the set of parameters (r, p, q, K 0 ) that yield the best fit to the data.Then we choose C thr and the corresponding estimates of other parameters leading to the overall best-fit to the data.Nonlinear least squares estimation weighs each of the data points equally and does not explicitly require a specific distributional assumption for y t , except for the first moment E[y t ] = f (t i ; Θ) .That is, the mean of the observed data at time t is equivalent to the expected count denoted by f (t, �) at time t [20].This method yields asymptotically unbiased point estimates regardless of any misspecification of the variance-covariance error structure.Hence, the estimated model mean f (t i , �) yields the best fit to observed data y t i in terms of squared L2 norm.We can solve the nonlinear least squares optimization problem using the fmincon function in MAT-LAB.Moreover, we also employ MATLAB's MultiStart feature to specify the number of random initial guesses of the model parameters using the parameter <numstart-points> in the options.mfile in order to search thoroughly for a global minimum, check that the solution is unique, and the parameters are identifiable.
We can also estimate parameters via maximum likelihood estimation (MLE) [19] and assume different error structures in the data (e.g., Poisson, negative binomial).The log-likelihood expressions derived for different error structures are specified below.

Poisson
For a Poisson error structure, the full log-likelihood of Poisson (i.e., <method1>=1) is given by:

options_forecast.m
Specifies the parameters related to the forecast, including the forecasting period, the type of ensemble weight for the ensemble models, and whether the forecasts will be evaluated.The structure of the options_forecast.mfile is given in Supplementary Text 2.

plot_SW_subepidemic.m
Plots simulations of the spatial wave sub-epidemic model.

Run_SW_subepidemicFramework.m
Derives the top-ranking sub-epidemic wave models to data with quantified uncertainty.plotRankings_SW_ subepidemicFramework.m Plots the mean model fits of the top-ranking models, including their sub-epidemic profiles, and the associated quality of model fit metrics, including the AICc, the relative likelihood, and the evidence ratio.plotFit_SW_ subepidemicFramework.m Displays the model fit and 95% prediction interval, as well as the empirical distribution of the parameters.It also saves output .csvfiles in the output folder with the model fit, the parameter estimates, including 95% CIs, and the calibration performance metrics.plotForecast_SW_ subepidemicFramework.m Displays the model-based forecast and the performance metrics of the forecast.Moreover, the data associated with the forecasts, the parameter estimates, as well as the calibration and forecasting performance metrics are saved as .csvfiles in the output folder.
where µ i = f (t i , θ) denotes the mean of y i at time t i .The number of parameters is just the number of parameters estimated in the dynamical model based on ordinary differential equations.

Negative binomial
Let r > 0 denote the number of failures until the experi- ment is stopped, p ∈ [0, 1] denote the success probability in each experiment.The number of successes y before the r-th failure occurs has a negative binomial distribution given by: For n observations y 1 , . . ., y n , the full log-likelihood is which can be expressed with µ and σ 2 by plugging-in p = 1 − µ σ 2 and r = µ 2 σ 2 −µ .There are different types of variances commonly used in a negative binomial distribution.If the variance scales linearly with the mean: σ 2 = µ + αµ , (i.e., <method1> = 3 in options.m),then p = α 1+α and r = µ/α .Let µ = f (t, θ) be the mean curve to be esti- mated from the differential equation.The full log-likelihood (1.1) can be expressed as follows: If the variance scales quadratically with the mean, σ 2 = µ + αµ 2 (i.e., <method1>=4 in options.m),then p = αµ 1+αµ and r = 1/α .The full log-likelihood (1.1) can be expressed as follows: The more general form of variance is σ 2 = µ + αµ d (i.e., <method1>=5 in options.m)with any −∞ < d < ∞ .Then the full log-likelihood (1.1) can be expressed as follows: where The number of parameters is 1 plus the number of parameters in the dynamical model based on ordinary differential equations (ODE) for (1.2) ~ (1.3), and 2 plus the number of parameters in the dynamical model for (1.4)

Parametric bootstrapping
To quantify parameter uncertainty, we follow a parametric bootstrapping approach which allows the computation of standard errors and related statistics in the absence of closed-form formulas [21].We generate B bootstrap samples from the best-fit model f (t, Θ) , with an assumed error structure specified using parameter <dist1> in the options.mfile to quantify the uncertainty of the parameter estimates and construct confidence intervals.Typically, the error structure in the data is modeled using a probability model such as the Poisson or negative binomial distribution.Using nonlinear least squares (< method1> = 0), besides a normally distributed error structure (<dist1> = 0), we can also assume a Poisson (<dist1> = 1) or a negative binomial distribution (<dist1> = 2) whereby the variance-tomean ratio is empirically estimated from the time series.
To estimate this constant ratio, we group a fixed number of observations (e.g., 7 observations for daily data into a bin across time), calculate the mean and variance for each bin, and then estimate a constant variance-to-mean ratio by calculating the average of the variance-to-mean ratios over these bins.
Using the best-fit model f (t, Θ) , we generate B-times replicated simulated datasets of size n d , where the obser- vation at time t j is sampled from the corresponding distri- bution specified by <dist1>.Next, we refit the model to each of the B simulated datasets to re-estimate the parameters using the same estimation method for the bootstrap sample as for the original data.This allows us to quantify the uncertainty of the estimate using that method.The new parameter estimates for each realization are denoted by b , where b = 1,2, . . ., B .Using the sets of re-estimated parameters ( b ), it is possible to characterize the empirical distribution of each estimate, calculate the variance, and construct confidence intervals for each parameter.The resulting uncertainty around the model fit can similarly be obtained from f t, 1 , f t, � 2 , . . ., f (t, � B ) .We characterize the uncertainty using 300 bootstrap realizations (i.e., parameter <B> = 300 in the options.mfile).

Model-based forecasts with quantified uncertainty
Forecasting the model f t, , h days ahead is based on the estimate f (t + h, �) .The uncertainty of the fore- casted value can be obtained using the previously described parametric bootstrap method.Let denote the forecasted value of the current state of the system propagated by a horizon of h time units, where b denotes the estimation of parameter set from the b th bootstrap sample.We can use these values to calculate the bootstrap variance to measure the uncertainty of the forecasts and use the 2.5% and 97.5% percentiles to construct the 95% prediction intervals (95% PIs).We can set the forecasting horizon using the parameter <fore-castingperiod1> in the options_forecast.m file.The structure of the options_forecast.mfile is described in Supplementary Text 2.

Sub-epidemic wave model
We use a spatial wave model with up to 5 parameters that aggregate linked overlapping sub-epidemics [13].This sub-epidemic framework can characterize diverse epidemic patterns, including the epidemic plateaus, where the epidemic stabilizes at a high level for an extended period and the epidemic waves have multiple peaks.The strength (e.g., weak vs. strong) of their overlap determines when the next sub-epidemic is triggered and is controlled by the onset threshold parameter, C thr .The mathemati- cal equation for the sub-epidemic building block is the 3-parameter generalized-logistic growth model (GLM), which is specified by setting the parameter <flag1>=1 in the options.mfile.This growth model has performed well in short-term forecasts of single outbreak trajectories for different infectious diseases, including COVID-19 [22][23][24].Alternative growth equations to model the sub-epidemic building block include the 3-parameter Richards model (<flag1>=4) and the 2-parameter logistic growth model (<flag1>=2).The following differential equation gives the generalized-logistic growth model (GLM): where C(t) denotes the cumulative curve at time t, and dC(t) dt describes the epidemic's incidence curve over time t.The positive parameter r denotes the growth rate per unit of time, K 0 is the final outbreak size, and p ∈ [0,1] is the "scaling of growth" parameter which allows the model to capture early sub-exponential and exponential growth patterns.If p = 0 , this equation describes a con- stant incidence over time, while p = 1 indicates that the early growth phase is exponential.Intermediate values of p(0 < p < 1) describe early sub-exponential (e.g., poly- nomial) growth dynamics.The sub-epidemic wave model consists of a system of coupled differential equations: Here, C i (t) is the cumulative number of infections for sub-epidemic i , and K i is the size of the i th sub-epidemic where i = 1, . . ., n .Starting from an initial sub-epidemic size K 0 , the size of consecutive sub-epidemics K i decline at the rate q following an exponential or power-law function as described below.Hence, a total of 5 parameters (r, p, C thr , q, K 0 ) for i = 1, . . ., n are needed to charac- terize a sub-epidemic wave composed of two or more sub-epidemics.
The onset timing of the subsequent (i + 1) th sub-epi- demic is determined by the indicator variable A i (t) .This results in a coupled system of sub-epidemics where the (i + 1) th sub-epidemic is triggered when the cumulative curve for the i th sub-epidemic exceeds a total of C thr .The sub-epidemics overlap because the ( i + 1) th sub-epidemic takes off before the i th sub-epidemic completes its course.That is, The threshold parameters are defined so that 1 ≤ C thr < K 0 and A 0 (t) = 1 for the first sub-epidemic.
The maximum number of sub-epidemics considered in the epidemic wave trajectory is specified using parameter <npatches_fixed> in the options.mfile.Here, we set <npatches_fixed>=3.The initial number of cases is given by C 1 (0) = I 0 , where I 0 is the initial number of cases in the observed data.
In this framework, the size of the subsequent i th sub- epidemic ( K i ) remains steady or declines due to the effects of behavior changes or interventions.We consider both exponential and inverse decline functions to model the size of consecutive sub-epidemics described below.

Exponential decline of sub-epidemic sizes
If consecutive sub-epidemics follow exponential decline, then K i is given by: where K 0 is the size of the initial sub-epidemic (K 1 = K 0 ) .If q = 0 , the model predicts an epidemic wave composed of sub-epidemics of the same size.When q > 0 , the epidemic wave is composed of a finite number of sub-epidemics given by n tot which is a func- tion of C thr , q, andK 0 as follows: where the brackets ⌊ * ⌋ denote the largest integer that is smaller than or equal to * .The total size of the epidemic wave composed of n tot overlapping sub-epidemics has the following closed-form solution: The exponential sub-epidemic decline function can be selected by setting the parameter <typedecline2>=1 in the options.mfile.

Power-law decline of sub-epidemic sizes
If consecutive sub-epidemics decline according to the inverse function, we have: When q > 0 , the total number of sub-epidemics n tot comprising the epidemic wave is finite and given by: The total size of an epidemic wave is given by the aggregation of n tot overlapping sub-epidemics: The power-law sub-epidemic decline function can be selected by setting the parameter <typedecline2> = 2 in the options.mfile.Selecting the type of decline function that yields the best fit to the data is also possible by setting the parameter, <typedecline2>=[1 2].

Fixed sub-epidemic onset
We can also consider sub-epidemic wave models with a fixed onset time at 0. In this case, all sub-epidemics start at time 0, and the threshold parameter C thr drops from the model.We use parameter <onset_fixed> in the options.mfile to specify whether the onset timing of the sub-epidemics is fixed at time 0 (<onset_fixed>=1) or not (<onset_fixed>=0).

Top-ranked sub-epidemic models
To select the top-ranked sub-epidemic models, we analyze the Akaike information criterion ( AIC c ) values of the set of best-fit sub-epidemic wave models with different values of C thr .The AIC c is given by [25,26]: where m is the number of model parameters, and n d is the number of data points.Specifically for normal distribution, the AIC c is where is the sum of squared errors, m is the number of model parameters including parameter C thr .Parameter <topmodelsx> in the options.mfile is used to specify the number of topranked models that will be generated and used to derive ensemble models.

Plotting simulations of the spatial wave sub-epidemic model
Before fitting the growth model to the data, it is useful to check that the selected model yields simulations broadly consistent with the range of the time series data by generating model simulations with different parameter values.For example, if data show systematic differences that contrast with the model solutions, it may suggest that the model is not the best choice for the data at hand.
The function plot_SW_subepidemic.m can be used to plot model solutions where the user provides the type of growth model by passing parameter <flag1> (generalized-logistic growth model, Richards, Gompertz, etc.), the model parameter values, and the initial conditions as passing input parameters to the function in the following order: <flag1>, r , p , a, K , q, n, C thr ,<type-decline1>, C(0) , and finally the duration of the simulation.For example, the following call plots a simulation of the spatial wave sub-epidemic model using as building block the generalized logistic growth model Of note, in the above call, the value of parameter a is passed empty ([]) since the generalized logistic growth model does not use this parameter.This function will generate a figure (Fig. 3A) that shows the corresponding model solution dC(t)/dt .Additional representative sim- ulations with other values of the C thr are shown in Fig. 3.
In the next section, we describe four comprehensive performance metrics that can be used to assess both calibration and forecasting performance.Specifically, the mean absolute error (MAE) and the mean squared error (MSE) are used to assess the performance of point forecasts, while the coverage of the 95% prediction interval (95% PI) and the weighted interval score (WIS) evaluate the performance of distributional forecasts by accounting for uncertainty in model fit and predictions.

Performance metrics
To assess the performance of the models during the calibration or forecasting periods, we used four performance metrics: the mean absolute error (MAE), the mean squared error (MSE), the coverage of the 95% prediction intervals (95% PI), and the weighted interval score (WIS) [27].While it is possible to generate h-time units ahead forecasts of an evolving process, those forecasts looking into the future can only be evaluated until sufficient data for the h-time units ahead has been collected.In the options_forecast.mfile, the parameter <get-performance> is a Boolean variable (0/1) to indicate whether the user wishes to compute the performance metrics of the forecasts when sufficient data is available.
The mean absolute error (MAE) is given by: where t h are the time points of the time series data [28], and N is the calibration or forecasting period length.Similarly, the mean squared error (MSE) is given by: where t h are the time points of the time series data [28], and N is the calibration or forecasting period length.The coverage of the 95% prediction interval (PI) corresponds to the fraction of data points that fall within the 95% PI, calculated as where L t h and U t h are the lower and upper bounds of the 95% PIs, respectively, Y t h are the data and 1 is an indicator variable that equals 1 if Y t h is in the specified interval and 0 otherwise.
The weighted interval score (WIS) [27,29], which is a proper score recently embraced for quantifying model forecasting performance in epidemic forecasting studies [30][31][32][33], provides quantiles of predictive forecast distribution by combining a set of Interval Scores (IS) for probabilistic forecasts.An IS is a simple proper score that requires only a central (1 − α) × 100% PI [27] and is described as In this Eq. 1 refers to the indicator function, meaning that 1 y < l = 1 if y < l and 0 otherwise.The terms l and u represent the α 2 and 1 − α 2 quantiles of the forecast F. The IS consists of three distinct quantities: 1.The sharpness of F , given by the width u − l of the cen- tral (1 − α) × 100% PI. 2. A penalty term 2 α × l − y × 1 y < l for the obser- vations that fall below the lower end point l of the (1 − α) × 100% PI.This penalty term is directly pro- portional to the distance between y and the lower end l of the PI.The strength of the penalty depends on the level α.
3. An analogous penalty term 2 α × y − u × 1 y > u for the observations falling above the upper limit u of the PI.
To provide more detailed and accurate information on the entire predictive distribution, we report several central PIs at different levels along with the pre- dictive median, ∼ y , which can be seen as a central predic- tion interval at level 1 − α 0 → 0 .This is referred to as the WIS, and it can be evaluated as follows: where, w k = α k 2 for k = 1,2, . . ..K and w 0 = 1 2 .Hence, WIS can be interpreted as a measure of how close the entire distribution is to the observation in units on the scale of the observed data [31,34].

Doubling times
Doubling times characterize the sequence of times at which the cumulative incidence doubles.We denote the times at which cumulative incidence doubles by t d j , such that 2C(t d j ) = C(t d j+1 ) where t d 0 = 0, C t d 0 = C 0 , j = 1,2, 3, . . ., n g and n g is the total number of times cumulative incidence doubles [35].The actual sequence of "doubling times" is defined as follows: For exponential growth, doubling times remain invariant and are given by (ln2)/r , whereas the doubling times increase when the growth pattern follows sub-exponential growth [36].We can characterize the doubling times and their uncertainty from the best-fit model f t, [37].We can evaluate the uncertainty of the sequence of doubling times and the overall doubling time using the model parameter estimates derived from bootstrapping b , where b = 1,2, 3, . . ., B .That is, d j b provides a sequence of doubling times for a set of bootstrap parameter estimates, b , where b = 1,2, 3, . . ., B .We can use these curves to derive 95% CIs for the sequence of doubling times and quantify the probability of observing a given number of doublings.

Constructing ensemble forecasts from top-ranking models
Ensemble models that combine the strength of multiple models may exhibit significantly enhanced predictive performance (e.g [11,17,38,39]).An ensemble model derived from the top-ranking I models is denoted by the Ensemble(1), illustrated in Fig. 4. Thus, Ensemble(2) and Ensemble(3) refer to the ensemble models generated from the combination of the top-ranking 2 and 3 models, respectively.The ensemble models can be derived from the unweighted (equal weights across contributing individual models) or a weighted combination of the highest-ranking sub-epidemic models based on the quality of fit as deemed by the AIC c i for the i-th model where AIC c 1 ≤ • • • ≤ AIC c I and i = 1, …, I.In this case, we com- pute the weight w i for the i-th model, i = 1, …, I, where w i = 1 as follows: and hence w I ≤ . . .≤ w 1 .
The estimated mean curve of daily COVID-19 cases for the Ensemble(I) model is: where given the training data, � (i) denotes the set of estimated parameters, and denotes the estimated mean curve of daily COVID-19 cases, for the i-th model.Accordingly, we compute the weighted average and sample the bootstrap realizations of the forecasts for each model to construct the 95% CI or PI using the 2.5% and 97.5% quantiles [17].Alternatively, we can set the ensemble weights based on different calibration performance metrics for the top-ranked models.For instance, we can make the ensemble weights proportional to the relative likelihood ( l ) rather than the reciprocal of the AIC c .Let AIC min denote the minimum AIC from the set of models.The relative likelihood of model i is given by l i = e ((AIC min −AIC i )/2) [40].We compute the weight w i for the i-th model where w i = 1 as follows: and hence w I ≤ . . .≤ w 1 .In the options_forecast.mfile, we can specify four types of ensemble weights using <weight_type1>.Specifically, unweighted (<weigth_type1>=-1), weighted according to the AIC c (<weight_type1>=0), weighted based on the relative likelihood (weight_ type1=1), weighted based on the reciprocal of the WIS metric of the calibration period (<weight_type1>=2).
In the options_forecast.mfile, we can specify the parameters related to the epidemic forecasts, including the forecasting horizon and the type of ensemble weights (Fig. 5).

Results and discussion
The dataset The time series data file is a text file with the extension *.txt in the input folder.The data file can contain one or more incidence time series (one per column).Each column corresponds to the incidence curve over time for each epidemic corresponding to a different area/group.For instance, each column could contain time series data corresponding to different U.S. states or countries worldwide.In the options.mfile, a specific data column can be accessed for inference using the parameter <outbreakx>.If the time series file contains cumulative incidence count data, the name of the file containing the time series data starts with "cumulative" according to the following format: cumulative-< cadtemporal>-<caddisease>-<datatype>-<cadregion>-<caddate1>.txt.
where <cadtemporal> is a string parameter that indicates the temporal resolution of the data (e.g., daily, weekly, yearly).Parameter <caddisease> is a string used to indicate the name of the disease related to the time series data, <datatype> is a string parameter indicating the nature of the data (e.g., cases, deaths, and hospitalizations), whereas <cadregion> is a string parameter indicating the geographic region of the time series contained in the file (New York, USA, World, Asia, Africa).Finally, <caddate1> is a string to indicate the date for the most recent observation in the data file with the format: mm-dd-yyyy.
To illustrate the methodology presented in this tutorial paper, we used daily COVID-19 cases reported in the USA from the publicly available data tracking system of the Johns Hopkins Center for Systems Science and Engineering (CSSE) [41].The data is also publicly available in the GitHub repository [42].An example of a data file that we will use in this tutorial is provided in Fig. 6.
If the time series file contains incidence data, the name of the data file does not start with the word 'cumulative' and follows the format:

< c a d t e m p o r a l > -< c a d d i s e a s e > -<datatype>-<cadregion>-<caddate1>.txt
For example: daily-coronavirus-cases-USA-05-11-2020.txt In the options.mfile, the parameter <datevec-first1> is a 3-value vector that specifies the date corresponding to the first data point in time series data in the format: [yyyy mm dd].Similarly, the parameter <datevecend1> is a 3-value vector that specifies the date of the most recent data file in the format: [yyyy mm dd].The file.cumulative-<cadtemporal>-<caddisease>-<datatype>-<cadregion>-<datevecend1>.txt in the input folder with the date <datevecend1> contains the most recent time series data and is needed to assess forecast performance.Finally, the parameter <DT> is an integer indicating the temporal resolution of the time series data (e.g., <DT> = 1 for daily data; <DT> = 7 for weekly data) (Fig. 7).

Data adjustments Data smoothing
To reduce the noise in the original data due to artificial reasons such as the weekend effects, we can smooth out the time series data using the moving average of the time series whereby <smoothfactor1> is a parameter in the options.mfile that specifies the span of the moving average (e.g., <smoothfactor1> = 1 implies no smoothing applied to the data).Let y t j = y t 1 , y t 2 , . . ., y t n d where j = 1,2, . . ., n d Fig. 6 Example data file named cumulative-daily-coronavirus-cases-USA-05-11-2020.txtlocated in the input folder.A partial view in Excel of the contents of the data file is shown denote the smoothed time series of the epidemic trajectory based on the moving average.Here, t j are the time points for the time series data, n d is the number of obser- vations, and each y t j , j = 1, 2, . . ., n d , correspond to the smoothed time series.We recommend that the user set the average to multiples of seven to reduce the weekend effects in the reported data.
For the daily COVID-19 case data employed for illustration purposes, we set <smoothfactor1> = 7 and smooth out the daily series using a 7-day moving average to reduce the noise in the original data due to artificial reasons such as the weekend effects.

Calibration period
To fit the models to the most recent observations in a time series file, we can specify the length of the calibration period whereby <calibrationperiod1> indicates the number of recent data points that will be used to calibrate the models.If <calibrationpe-riod1> exceeds the length of the time series in the data file, the calibration period is set to the maximum length of the available data.

Fitting the sub-epidemic wave models to data with quantified uncertainty
To fit the sub-epidemic wave models to the data with quantified uncertainty, we need to run the function These output internal files are needed to plot model fits, derive parameter estimates, generate short-term forecasts, and quantify the calibration and forecasting performance metrics.

Plot the mean model fits and quality of fit metrics for the top-ranked models
After running the function Run_SW_subepidemic Framework.mwith the desired input parameters, we can use the function plotRankings_SW_subepidemic Framework.m to plot the mean model fits of the top-ranking models including their sub-epidemic profiles and the associated quality of model fit metrics including the AIC c , the rela- tive likelihood, and the evidence ratio based on the inputs.However, this function can also receive <outbreakx> and <caddate1> as passing input parameters while the remaining inputs are obtained from the options.mfile.Running this function from MATLAB's command line, we have: >>plotRankings_SW_subepidemicFramework(52,'05-11-2020') Figures 9 and 10 illustrate the outputs obtained from this function call.Figure 9 shows the mean model fits of the top-ranked sub-epidemic models, which indicates that the 1st-ranked model consists of 3 sub-epidemics.In contrast, the 2nd, 3rd, and 4th -ranked sub-epidemic models consist of 2 sub-epidemics.It is important to note that there was severe underreporting of cases during the early phase of the epidemic.The corresponding goodness of fit statistics of the top-ranked models, including the AIC c , the relative likelihood, and the evidence ratio, are shown in Fig. 10.It also saves the AIC c values of the top- ranked models in the following .csvfile: A I C c -t o p R a n k e d -o n s e t f i x e d -0 -t ypedecline-2-flag1-1-method-0-dist-0 -d a i l y -c o r o n a v i r u s -c a s e s -U S Aarea-52-05-11-2020.csv.

Plot the model fits, parameter estimates, and performance metrics of the top-ranking models
Using the function plotFit_SW_subepidemic Framework.m,we can plot the fits of the top-ranking models, including their sub-epidemic profiles, parameter estimates, and residual plots based on the inputs indicated in the options.mfile.However, this function can also receive <outbreakx> and <caddate1> as passing input parameters while the remaining inputs are obtained from the options.mfile.
In addition, this function also plots the empirical distributions of the parameters associated with each of the top-ranking models and the calibration performance metrics (MSE, MAE, 95% P.I., and WIS).Finally, this The model captures the entire epidemic period well, including the broad peak dynamics, by integrating three asynchronous sub-epidemics.The best model fit (solid red line) and 95% prediction interval (dashed red lines) are shown.The cyan curves correspond to the associated uncertainty from individual bootstrapped curves, which are used to derive the 95% prediction intervals.The sub-epidemic mean profiles obtained from the parametric bootstrapping with 300 bootstrap realizations are shown in the center panels.The red, blue, and green curves represent the three sub-epidemic profiles, and the grey curves are the estimated aggregate epidemic trajectories.Black circles correspond to the data points.The empirical distributions of the parameters and the corresponding estimates are shown in the top panels.The residuals are also shown function also outputs .csvfiles in the output folder with the calibration performance metrics, the parameter estimates associated with the top-ranking models, the corresponding Monte Carlo standard errors of the parameters, and the estimated sequence of doubling times for each of the top-ranked models.Using the default parameter values indicated in the options.mfile, the actual call to this function from MATLAB's command line follows: >> plotFit_SW_subepidemicFramework Figures 11 and 12 illustrate the outputs from the above call to the function.The fits of the 1st and 2nd ranked sub-epidemic models, including the sub-epidemic profiles and residuals, to the daily curve of COVID-19 cases are shown in Figs.11 and 12.These models yield a similarly good fit to the data.The figures also include the empirical distribution of the parameter estimates.These parameter estimates are well identified as the confidence intervals lie in a well-defined range of values [13].The calibration performance metrics capturing the quality of fit of the top-ranked sub-epidemic models are also displayed in Fig. 13.For instance, this figure indicates that the coverage of the 95% PIs varied little between ~ 93% and 95% for the top-ranked models.This function will store the following .csvfiles in the output folder:  A relevant issue to investigate when using any mathematical model is that of structural or practical parameter identifiability [43].Structural identifiability arises when one or more model parameters cannot be uniquely estimated using the model, even when the data is free of noise.That is, a lack of structural identifiability is due to issues in the model structure, such as the presence of parameter correlations [12].On the other hand, practical identifiability occurs when one or more parameters cannot be reliably estimated using the available observed data, which could be associated with the number of observations available for model calibration and the spatial-temporal resolution of the data.Because the time series of incident cases in the observed epidemic wave is an aggregation of overlapping sub-epidemics, there could be instances when different sub-epidemic profiles may give rise to indistinguishable aggregated epidemic waves as noted elsewhere [44].

Generate the top-ranked and ensemble sub-epidemic model forecasts and the associated forecasting performance metrics
Using the function plotForecast_SW_subepidemic Framework.m,we can plot the short-term forecasts from the top-ranking sub-epidemic models and the ensemble models derived from the top-ranking sub-epidemic models based on the inputs indicated in the options.m and the options_forecast.mfiles.However, this function can also receive parameters < outbreakx>, <caddate1>, or <forecasting period> as passing input parameters while the remaining inputs are read from the options.m and options_ forecast.mfiles.Moreover, the data associated with each top-ranked model and ensemble forecasts are saved as .csvfiles in the output folder.
In addition, this function also plots the forecasting performance metrics (MSE, MAE, 95% P.I., WIS) for the topranking models and the ensemble sub-epidemic wave models.Finally, this function also stores *.csv files in the output folder with the forecasting performance metrics associated with the top-ranking and ensemble models, and the estimated doubling times for each of the topranked models.Using the default parameter values indicated in the options.m,and options_forecast.mfiles, the call to this function from MATLAB's command line follows: >>plotForecast_subepidemicFramework Figures 14 and 15 illustrate the outputs obtained from this function call.Figure 14 shows the 30-day forecasts derived from the top-ranking sub-epidemic models, whereas Fig. 15 shows the sub-epidemic profiles of the forecasts.These forecasts indicate that the 1st-ranked model outperformed the other top-ranked models.Moreover, the data associated with the top-ranked model forecasts are also saved as .csvfiles in the output folder.The forecasting performance metrics for the top-ranked models are displayed in Fig. 16, and these metrics are also saved in a .csvfile in the output folder.In comparison, the forecast derived from the simpler growth model consisting of a single sub-epidemic (<npatches_ fixed>=1) was substantially worse, as shown in Supplementary Fig. 2.
The corresponding 3 ensemble forecasts (Ensemble(2), Ensemble(3), and Ensemble(4)) derived from the weighted combination of the top-ranked models based on their relative likelihood or Akaike weights (e.g., < weight_type1>=1 in the options_forecast.mfile) are shown in Fig. 17.Also, the corresponding forecasting performance metrics for the ensemble models are shown in Fig. 18 and are saved in a .csvfile in the output folder.The Ensemble(4) performed slightly better than the Ensemble(2) and Ensemble(3) models in terms of the WIS and coverage of the 95% prediction interval.This function will store the following .csvfiles in the output folder: 1) Forecasting performance metrics of the top-ranked models:   We can also compare the performance of unweighted ensemble models by setting the parameter <weight_ type1>=-1 in the options_forecast.mfile while the other parameters are kept unchanged.Then we can compare the performance of the unweighted ensemble models (equal weights across top-ranked models) with the weighted ensemble models, where the weights are proportional to the relative likelihood of the models (<weight_type1>= 1).We can run the function plotForecast_subepidemicFramework.m to generate the new set of forecasts with the new models.
The forecasting performance metrics for the weighted and unweighted ensemble models and other statistical time-series models are displayed in Table 2. Overall, the unweighted ensemble models performed similarly as their weighted ensemble counterparts for this forecast and outperformed some popular statistical time-series models such as ARIMA (a brief description of the statistical models is given in Supplementary Text S3).

Conclusion
We have introduced a MATLAB toolbox to fit and forecast time series using the spatial wave sub-epidemic model originally developed to generate short-term forecasts of epidemics [13] and illustrated its functionality using time-series data of the COVID-19 pandemic in the US.In particular, the sub-epidemic model used in this tutorial has shown competitive performance in characterizing and forecasting epidemic trajectories of infectious diseases such as COVID-19, Ebola, and plague [13,15].The toolbox can be a helpful resource for policy makers and used as a part of the curriculum for students training in infectious disease modeling, mathematical biology, applied statistics and mathematics, and special courses in epidemic modeling and time-series forecasting.
This new open-source toolbox and associated tutorial will be helpful to a broad community of applied scientists interested in characterizing and forecasting time-series data that results from the aggregation of Table 2 Forecasting performance metrics derived from the weighted and unweighted ensemble models, an auto-regressive integrated moving average model (ARIMA), a generalized additive model (GAM), and simple linear regression model (SLR) based on the daily curve of COVID-19 cases in the USA from 11-May-2020 to 11-June-2020.The weights of the weighted ensemble model are based on relative likelihood.Overall, both ensemble types performed similarly for this forecast, and outperformed the simple statistical models multiple asynchronous underlying growth processes.Moreover, prior publications have extensively validated the tools presented here [13,15].The models and methods included in the toolbox have improved short-term forecasting performance over simpler growth models such as the Richards and generalized-logistic growth models.Moreover, we have ensured publicly available, long-term, and stable hosting of the toolbox in a public GitHub repository.Extensions to the toolbox could include additional components, such as new model features, alternative estimation methods, and additional forecasting performance metrics.

Fig. 1
Fig. 1 Contents of options.m file, the values of the parameters related to the parameter estimation method and parametric bootstrapping

Fig. 2 Fig. 3
Fig. 2 Contents of options.m file, the values of the parameters related to the sub-epidemic wave model and the number of top-ranked sub-epidemic wave models

Fig. 4
Fig. 4 Schematic diagram of the construction of the ensemble model from the weighted combination of the highest-ranking sub-epidemic models as deemed by the AIC c k for the k-th model where AIC c1 ≤ • • • ≤ AIC c K and k = 1, …, K.An ensemble derived from the top-ranking K models is denoted by Ensemble(K)

Fig. 5
Fig. 5 Contents of options_forecast.m file, that specify the parameters related to the epidemic forecasts including the forecasting horizon and the type of ensemble weights (e.g., unweighted, weighted based on AIC c , weighted based on the relative likelihood of the models, and weighted based on the WIS of the calibration period)

Fig. 7
Fig. 7 Contents of options.m file, and the values of the parameters related to the data including the temporal resolution of the time series data

Fig. 8 Fig. 9 Fig. 10
Fig. 8 Contents of options.m file, the values of the parameters related to smoothing and calibration period

Fig. 11
Fig. 11 Fit of the 1st-ranked sub-epidemic wave model to the daily curve of COVID-19 cases in the USA from 27-Feb-2020 to 11-May-2020.The model captures the entire epidemic period well, including the broad peak dynamics, by integrating three asynchronous sub-epidemics.The best model fit (solid red line) and 95% prediction interval (dashed red lines) are shown.The cyan curves correspond to the associated uncertainty from individual bootstrapped curves, which are used to derive the 95% prediction intervals.The sub-epidemic mean profiles obtained from the parametric bootstrapping with 300 bootstrap realizations are shown in the center panels.The red, blue, and green curves represent the three sub-epidemic profiles, and the grey curves are the estimated aggregate epidemic trajectories.Black circles correspond to the data points.The empirical distributions of the parameters and the corresponding estimates are shown in the top panels.The residuals are also shown

Fig. 12
Fig. 12 Fit of the 2nd-ranked sub-epidemic wave model to the daily curve of COVID-19 cases in the USA from 27-Feb-2020 to 11-May-2020.The model captures the entire epidemic period well, including the broad peak dynamics, by integrating three asynchronous sub-epidemics.The best model fit (solid red line) and 95% prediction interval (dashed red lines) are shown.The cyan curves correspond to the associated uncertainty from individual bootstrapped curves, which are used to derive the 95% prediction intervals.The sub-epidemic mean profiles obtained from the parametric bootstrapping with 300 bootstrap realizations are shown in the center panels.The red, blue, and green curves represent the two sub-epidemic profiles, and the grey curves are the estimated aggregate epidemic trajectories.Black circles correspond to the data points.The empirical distributions of the parameters and the corresponding estimates are shown in the top panels.The residuals are also shown

Fig. 17 E n s e m b l e ( 2 )
Fig.17 30-day sub-epidemic ensemble model forecasts (Ensemble(2), Ensemble(3), Ensemble(4)) of COVID-19 cases in the USA from 11-May-2020 to 11-June-2020.Circles correspond to the data points.The model fits (solid line), and 95% prediction intervals (shaded area) are shown.The vertical line indicates the start time of the forecast.Of note, the data associated with each ensemble model forecast are also saved as .csvfiles in the output folder

Table 1
Description of the user functions available in the SpatialWavePredict toolbox Function Roleoptions.mSpecifies the parameters related to model fitting, including the characteristics of the time series data, the sub-epidemic model, parameter estimation method, error structure, smoothing, and calibration period.The structure of the options.mfile is given in Supplementary Text 1.